Unicode对象和编码解码器

您所在的位置：网站首页 › c ascii码表 › Unicode对象和编码解码器

Unicode对象和编码解码器

2023-06-06 22:26| 来源: 网络整理| 查看: 265

Unicode对象和编码解码器¶ Unicode对象¶

自从python3.3中实现了 PEP 393 以来，Unicode对象在内部使用各种表示形式，以便在保持内存效率的同时处理完整范围的Unicode字符。对于所有代码点都低于128、256或65536的字符串，有一些特殊情况；否则，代码点必须低于1114112（这是完整的Unicode范围）。

UTF-8 representation is created on demand and cached in the Unicode object.

备注

The Py_UNICODE representation has been removed since Python 3.12 with deprecated APIs. See PEP 623 for more information.

Unicode类型¶

These are the basic Unicode object types used for the Unicode implementation in Python:

type Py_UCS4¶ type Py_UCS2¶ type Py_UCS1¶ Part of the Stable ABI.

These types are typedefs for unsigned integer types wide enough to contain characters of 32 bits, 16 bits and 8 bits, respectively. When dealing with single Unicode characters, use Py_UCS4.

在 3.3 版本加入.

type Py_UNICODE¶

This is a typedef of wchar_t, which is a 16-bit type or 32-bit type depending on the platform.

在 3.3 版本发生变更: In previous versions, this was a 16-bit type or a 32-bit type depending on whether you selected a "narrow" or "wide" Unicode version of Python at build time.

从版本 3.13 开始标记为过时，将在版本 3.15 中移除。.

type PyASCIIObject¶ type PyCompactUnicodeObject¶ type PyUnicodeObject¶

These subtypes of PyObject represent a Python Unicode object. In almost all cases, they shouldn't be used directly, since all API functions that deal with Unicode objects take and return PyObject pointers.

在 3.3 版本加入.

PyTypeObject PyUnicode_Type¶ Part of the Stable ABI.

This instance of PyTypeObject represents the Python Unicode type. It is exposed to Python code as str.

The following APIs are C macros and static inlined functions for fast checks and access to internal read-only data of Unicode objects:

int PyUnicode_Check(PyObject *o)¶

Return true if the object o is a Unicode object or an instance of a Unicode subtype. This function always succeeds.

int PyUnicode_CheckExact(PyObject *o)¶

Return true if the object o is a Unicode object, but not an instance of a subtype. This function always succeeds.

int PyUnicode_READY(PyObject *o)¶

Returns 0. This API is kept only for backward compatibility.

在 3.3 版本加入.

自 3.10 版本弃用: This API does nothing since Python 3.12.

Py_ssize_t PyUnicode_GET_LENGTH(PyObject *o)¶

Return the length of the Unicode string, in code points. o has to be a Unicode object in the "canonical" representation (not checked).

在 3.3 版本加入.

Py_UCS1 *PyUnicode_1BYTE_DATA(PyObject *o)¶ Py_UCS2 *PyUnicode_2BYTE_DATA(PyObject *o)¶ Py_UCS4 *PyUnicode_4BYTE_DATA(PyObject *o)¶

Return a pointer to the canonical representation cast to UCS1, UCS2 or UCS4 integer types for direct character access. No checks are performed if the canonical representation has the correct character size; use PyUnicode_KIND() to select the right function.

在 3.3 版本加入.

PyUnicode_1BYTE_KIND¶ PyUnicode_2BYTE_KIND¶ PyUnicode_4BYTE_KIND¶

返回 PyUnicode_KIND() 宏的值。

在 3.3 版本加入.

在 3.12 版本发生变更: PyUnicode_WCHAR_KIND has been removed.

int PyUnicode_KIND(PyObject *o)¶

Return one of the PyUnicode kind constants (see above) that indicate how many bytes per character this Unicode object uses to store its data. o has to be a Unicode object in the "canonical" representation (not checked).

在 3.3 版本加入.

void *PyUnicode_DATA(PyObject *o)¶

Return a void pointer to the raw Unicode buffer. o has to be a Unicode object in the "canonical" representation (not checked).

在 3.3 版本加入.

void PyUnicode_WRITE(int kind, void *data, Py_ssize_t index, Py_UCS4 value)¶

Write into a canonical representation data (as obtained with PyUnicode_DATA()). This function performs no sanity checks, and is intended for usage in loops. The caller should cache the kind value and data pointer as obtained from other calls. index is the index in the string (starts at 0) and value is the new code point value which should be written to that location.

在 3.3 版本加入.

Py_UCS4 PyUnicode_READ(int kind, void *data, Py_ssize_t index)¶

Read a code point from a canonical representation data (as obtained with PyUnicode_DATA()). No checks or ready calls are performed.

在 3.3 版本加入.

Py_UCS4 PyUnicode_READ_CHAR(PyObject *o, Py_ssize_t index)¶

Read a character from a Unicode object o, which must be in the "canonical" representation. This is less efficient than PyUnicode_READ() if you do multiple consecutive reads.

在 3.3 版本加入.

Py_UCS4 PyUnicode_MAX_CHAR_VALUE(PyObject *o)¶

Return the maximum code point that is suitable for creating another string based on o, which must be in the "canonical" representation. This is always an approximation but more efficient than iterating over the string.

在 3.3 版本加入.

int PyUnicode_IsIdentifier(PyObject *o)¶ Part of the Stable ABI.

Return 1 if the string is a valid identifier according to the language definition, section 标识符和关键字. Return 0 otherwise.

在 3.9 版本发生变更: The function does not call Py_FatalError() anymore if the string is not ready.

Unicode字符属性¶

Unicode provides many different character properties. The most often needed ones are available through these macros which are mapped to C functions depending on the Python configuration.

int Py_UNICODE_ISSPACE(Py_UCS4 ch)¶

根据 ch 是否为空白字符返回 1 或 0。

int Py_UNICODE_ISLOWER(Py_UCS4 ch)¶

根据 ch 是否为小写字符返回 1 或 0。

int Py_UNICODE_ISUPPER(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is an uppercase character.

int Py_UNICODE_ISTITLE(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a titlecase character.

int Py_UNICODE_ISLINEBREAK(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a linebreak character.

int Py_UNICODE_ISDECIMAL(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a decimal character.

int Py_UNICODE_ISDIGIT(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a digit character.

int Py_UNICODE_ISNUMERIC(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a numeric character.

int Py_UNICODE_ISALPHA(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is an alphabetic character.

int Py_UNICODE_ISALNUM(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is an alphanumeric character.

int Py_UNICODE_ISPRINTABLE(Py_UCS4 ch)¶

Return 1 or 0 depending on whether ch is a printable character. Nonprintable characters are those characters defined in the Unicode character database as "Other" or "Separator", excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

These APIs can be used for fast direct character conversions:

Py_UCS4 Py_UNICODE_TOLOWER(Py_UCS4 ch)¶

Return the character ch converted to lower case.

自 3.3 版本弃用: This function uses simple case mappings.

Py_UCS4 Py_UNICODE_TOUPPER(Py_UCS4 ch)¶

Return the character ch converted to upper case.

自 3.3 版本弃用: This function uses simple case mappings.

Py_UCS4 Py_UNICODE_TOTITLE(Py_UCS4 ch)¶

Return the character ch converted to title case.

自 3.3 版本弃用: This function uses simple case mappings.

int Py_UNICODE_TODECIMAL(Py_UCS4 ch)¶

Return the character ch converted to a decimal positive integer. Return -1 if this is not possible. This function does not raise exceptions.

int Py_UNICODE_TODIGIT(Py_UCS4 ch)¶

Return the character ch converted to a single digit integer. Return -1 if this is not possible. This function does not raise exceptions.

double Py_UNICODE_TONUMERIC(Py_UCS4 ch)¶

Return the character ch converted to a double. Return -1.0 if this is not possible. This function does not raise exceptions.

These APIs can be used to work with surrogates:

int Py_UNICODE_IS_SURROGATE(Py_UCS4 ch)¶

Check if ch is a surrogate (0xD800

【本文地址】

公司简介

联系我们